Kubernetes Privileged Containers and Namespaces

Recently, I was investigating the Linux namespace isolation between privileged pods and learned an interesting fact of how Kubernetes treats privileged pods at the Linux namespace level.

The following post is notes from a full day of trying to answer the question: What are the differences in isolation and capabilities between a privileged pod and a non-privileged pod with specific host namespace sharing configurations in Kubernetes?

TLDR;

Marking a pod as privileged: true does not instruct Kubernetes to place the Pod’s container into the same namespaces as the host. A privileged pod still has most of its namespaces isolated from the host, except for the user namespace (which I hope to touch on in a later post)

However, setting specific pod specs such as hostNetwork: true, hostPID: true, and hostIPC: true causes the pod to share those specific namespaces (network, net (and uts), ipcrespectively) with the host.

What setting privileged: true is doing at the system level is allocating the Pod’s containers a broad set of Linux capabilities, allowing it to perform privileged system operations, but it does not technically share the namespaces like hostNetwork/hostPID/hostIPC.

Privileged Pod Investigation

Consider the following Pod manifest that’s only noteworthy field is the privileged: true field.

apiVersion: v1
kind: Pod
metadata:
  name: privileged-only
spec:
  containers:
  - name: busybox
    image: busybox
    command: ["sleep", "3600"]
    securityContext:
      privileged: true

Often, privileged pods are described as “dissolving all isolation between the pod and the node” or “Processes within the container get almost the same privileges that are available to processes outside a container” - KubeSec.io

This is common over simplification. While typically getting the point of “don’t use privileged pods wherever possible” across, sometimes digging into these areas can lead to interesting insights into how technology works.

The (not entirely correct) assumption that I made was that marking a pod as privileged: true, simply instruct Kubernetes (Or really the Container Runtime Interface) to skip creating a new namespace for the pod. Reasonable assumption? I think so, which is why I was somewhat surprised when I noticed that when running the above Pod manifest, the only namespace the Pod shared with the host was the user namespace.

# Namespace: The linux namespace
# HOST-ID: The namespace ID of the host. 
# CONTAINER-ID: The namespace ID of the container
# NS-STATUS: ISOLATED == Pod's container is in it's own namespace
#            SHARED   == Pod's container shares the host's namespace
NAMESPACE  HOST-ID                 CONTAINER-ID               NS-STATUS
cgroup     4026534899                4026535998               ISOLATED
ipc        4026534893                4026535503               ISOLATED
mnt        4026534836                4026535995               ISOLATED
net        4026534900                4026535505               ISOLATED
pid        4026534898                4026535997               ISOLATED
user       4026531837                4026531837               SHARED
uts        4026534891                4026535996               ISOLATED

A quick aside about the HOST-IDs and CONTAINER-IDs. These numbers are the inode numbers that identify each namespace in the kernel.

  1. When two processes have the same inode number for a particular namespace, it means they share that namespace.
  2. When they have different inode numbers, they’re “isolated” from each other by being in different namespaces.

I wanted to do some testing to figure out what exactly was going on.

If a privileged Pod’s container isn’t sharing namespaces with the host, does setting any of those namespaces manually in the Pod manifest also behave that way? For example, does hostNetwork: true in the following manifest share the net namespace with the host??

# Example
apiVersion: v1
kind: Pod
metadata:
  name: network-shared
spec:
  hostNetwork: true
  containers:
  - name: busybox
    image: busybox
    command: ["sleep", "3600"]

A helping hand

We could test this by manually writing Pod manifests with our desired specifications, but it’s painful, time-consuming, and error prone. The following script allows us to check each individual Pod spec that controls these namespaces.

#!/bin/bash

# Default flag values
DO_PID=false
DO_NETWORK=false
DO_IPC=false

# Parse command line arguments
while [[ $# -gt 0 ]]; do
  key="$1"
  case $key in
    --pid)
      DO_PID=true
      shift
      ;;
    --hostnetwork)
      DO_NETWORK=true
      shift
      ;;
    --ipc)
      DO_IPC=true
      shift
      ;;
    --help)
      echo "Usage: $0 [--pid] [--hostnetwork] [--ipc]"
      echo "Specify which namespace sharing options to test"
      exit 0
      ;;
    *)
      echo "Unknown option: $1"
      echo "Usage: $0 [--pid] [--hostnetwork] [--ipc]"
      exit 1
      ;;
  esac
done

# If no flags specified, show usage
if ! $DO_PID && ! $DO_NETWORK && ! $DO_IPC; then
  echo "Usage: $0 [--pid] [--hostnetwork] [--ipc]"
  echo "At least one option must be specified"
  exit 1
fi

# Start minikube if not running
minikube status &> /dev/null || minikube start
# Wait for default service account
while ! kubectl get serviceaccount default &> /dev/null; do
  sleep 2
done

# Function to check pod namespaces
check_pod() {
  NAME=$1
  echo "POD: $NAME"
  # Wait for pod to be ready
  while [ "$(kubectl get pod $NAME -o 'jsonpath={.status.phase}' 2>/dev/null)" != "Running" ]; do
    sleep 2
  done
  # Get container ID
  CONTAINER_ID=$(kubectl get pod $NAME -o 'jsonpath={.status.containerStatuses[0].containerID}' | sed 's/docker:\/\///')
  # Check namespaces
  minikube ssh "
    CONTAINER_PID=\$(sudo docker inspect --format='{{.State.Pid}}' $CONTAINER_ID)
    echo 'NAMESPACE  HOST-ID                 CONTAINER-ID               NS-STATUS'
    for NS in cgroup ipc mnt net pid user uts; do
      HOST_NS=\$(sudo readlink /proc/1/ns/\$NS)
      CONTAINER_NS=\$(sudo readlink /proc/\$CONTAINER_PID/ns/\$NS)
      HOST_ID=\$(echo \$HOST_NS | sed 's/.*\\[\\(.*\\)\\]/\\1/')
      CONTAINER_ID=\$(echo \$CONTAINER_NS | sed 's/.*\\[\\(.*\\)\\]/\\1/')
      SHARED=\$([ \"\$HOST_NS\" = \"\$CONTAINER_NS\" ] && echo 'SHARED' || echo 'ISOLATED')
      printf \"%-10s %-25s %-24s %s\\n\" \"\$NS\" \"\$HOST_ID\" \"\$CONTAINER_ID\" \"\$SHARED\"
    done
  "
  echo ""
}

# Create shared network pod if requested
if $DO_NETWORK; then
  echo "=== hostNetwork ==="
  POD1=$(cat <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: network-shared
spec:
  hostNetwork: true
  containers:
  - name: busybox
    image: busybox
    command: ["sleep", "3600"]
EOF
  )
  echo "$POD1"
  echo "$POD1" | kubectl apply -f -
  check_pod "network-shared"
  kubectl delete pod network-shared --force &> /dev/null
fi

# Create shared PID pod if requested
if $DO_PID; then
  echo "=== hostPID ==="
  POD2=$(cat <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: pid-shared
spec:
  hostPID: true
  containers:
  - name: busybox
    image: busybox
    command: ["sleep", "3600"]
EOF
  )
  echo "$POD2"
  echo "$POD2" | kubectl apply -f -
  check_pod "pid-shared"
  kubectl delete pod pid-shared --force &> /dev/null
fi

# Create shared IPC pod if requested
if $DO_IPC; then
  echo "=== hostIPC ==="
  POD3=$(cat <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: ipc-shared
spec:
  hostIPC: true
  containers:
  - name: busybox
    image: busybox
    command: ["sleep", "3600"]
EOF
  )
  echo "$POD3"
  echo "$POD3" | kubectl apply -f -
  check_pod "ipc-shared"
  kubectl delete pod ipc-shared --force &> /dev/null
fi

Running ./checkNS.sh --ipc --pid --hostnetwork yields tests each of the big three pod specifications I’m interested in. The results are shown below.

Results of checkNS.sh

Lets dig in to each of these

hostNetwork

The Pod being applied is only setting hostNetwork: true.

apiVersion: v1
kind: Pod
metadata:
  name: shared
spec:
  hostNetwork: true # Here
  containers:
  - name: busybox
    image: busybox
    command: ["sleep", "3600"]

Interestingly, the net, uts, and user namespaces are now being shared between the host and the container. The net namespace makes sense as we explicitly set hostNetwork: true.

The interesting observation is that setting hostNetwork: true also shares uts namespace between the container and the host. I’m not 100% sure why this is, but it makes sense as the uts namespace is used for setting the hostname/domain name.

NAMESPACE  HOST-ID                 CONTAINER-ID               NS-STATUS
cgroup     4026534899                4026536006               ISOLATED
ipc        4026534893                4026536001               ISOLATED
mnt        4026534836                4026536004               ISOLATED
net        4026534900                4026534900               SHARED
pid        4026534898                4026536005               ISOLATED
user       4026531837                4026531837               SHARED
uts        4026534891                4026534891               SHARED

hostNetwork demonstration

Exec-ing into the Pod’s container allows for inspection of resources in the hosts network namespace from inside the container by running ss -tuln | grep LISTEN on both the host and inside the Pod’s container.

#!/bin/bash

# Start minikube if not running
minikube status &> /dev/null || minikube start

# Silently install iproute2 on minikube node if ss is not available
minikube ssh "command -v ss >/dev/null 2>&1 || { sudo apt-get update -qq && sudo apt-get install -qq iproute2 >/dev/null; }" 

# Create hostNetwork pod with a network tools image instead of busybox
echo "=== Creating pod with hostNetwork ==="
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: network-shared
spec:
  hostNetwork: true
  containers:
    - name: network-test
      image: nicolaka/netshoot # Note we changed the image
      command: ["sleep", "3600"]
EOF

# Wait for pod to be ready
echo "Waiting for pod to be running..."
while [ "$(kubectl get pod network-shared -o 'jsonpath={.status.phase}' 2>/dev/null)" != "Running" ]; do
  sleep 2
done

# Check namespace to confirm network is shared
CONTAINER_ID=$(kubectl get pod network-shared -o 'jsonpath={.status.containerStatuses[0].containerID}' | sed 's/docker:\/\///')
minikube ssh "
  CONTAINER_PID=\$(sudo docker inspect --format='{{.State.Pid}}' $CONTAINER_ID)
  echo 'NAMESPACE  HOST-ID                 CONTAINER-ID               NS-STATUS'
  HOST_NS=\$(sudo readlink /proc/1/ns/net)
  CONTAINER_NS=\$(sudo readlink /proc/\$CONTAINER_PID/ns/net)
  HOST_ID=\$(echo \$HOST_NS | sed 's/.*\\[\\(.*\\)\\]/\\1/')
  CONTAINER_ID=\$(echo \$CONTAINER_NS | sed 's/.*\\[\\(.*\\)\\]/\\1/')
  SHARED=\$([ \"\$HOST_NS\" = \"\$CONTAINER_NS\" ] && echo 'SHARED' || echo 'ISOLATED')
  printf \"%-10s %-25s %-24s %s\\n\" \"net\" \"\$HOST_ID\" \"\$CONTAINER_ID\" \"\$SHARED\"
"

# Compare listening ports using ss
echo -e "\n=== Ports listening on host ==="
minikube ssh "ss -tuln | grep LISTEN"

echo -e "\n=== Ports visible from container ==="
kubectl exec -it network-shared -- ss -tuln | grep LISTEN

# Clean up
echo -e "\n=== Cleaning up ==="
kubectl delete pod network-shared --force
Results of hostNetwork.sh

hostPID

The next pod being applied is only setting hostPID: true.

apiVersion: v1
kind: Pod
metadata:
  name: shared
spec:
  hostPID: true # Here
  containers:
  - name: busybox
    image: busybox
    command: ["sleep", "3600"]

Setting hostPID: true shares the pid namespace between the host and the container, allowing for the container to interact with processes on the host. No surprises there.

NAMESPACE  HOST-ID                 CONTAINER-ID               NS-STATUS
cgroup     4026534899                4026536126               ISOLATED
ipc        4026534893                4026536063               ISOLATED
mnt        4026534836                4026536124               ISOLATED
net        4026534900                4026536064               ISOLATED
pid        4026534898                4026534898               SHARED
user       4026531837                4026531837               SHARED
uts        4026534891                4026536125               ISOLATED

hostPID Demonstration

Exec-ing into the pod would allow us to see the host’s processes by running ps aux. Here is a modification of our earlier script to demonstrate that. Note the output is showing processes from the node.

#!/bin/bash

# Start minikube if not running
minikube status &> /dev/null || minikube start

# Wait for default service account
while ! kubectl get serviceaccount default &> /dev/null; do
  sleep 2
done

# Create hostPID pod
echo "=== Creating pod with hostPID ==="
POD=$(cat <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: pid-shared
spec:
  hostPID: true
  containers:
  - name: busybox
    image: busybox
    command: ["sleep", "3600"]
EOF
)

echo "$POD" | kubectl apply -f -

# Wait for pod to be ready
echo "Waiting for pod to be running..."
while [ "$(kubectl get pod pid-shared -o 'jsonpath={.status.phase}' 2>/dev/null)" != "Running" ]; do
  sleep 2
done

# Function to check pod namespaces
echo "Checking pod namespaces..."
CONTAINER_ID=$(kubectl get pod pid-shared -o 'jsonpath={.status.containerStatuses[0].containerID}' | sed 's/docker:\/\///')

# Check namespaces
minikube ssh "
  CONTAINER_PID=\$(sudo docker inspect --format='{{.State.Pid}}' $CONTAINER_ID)
  echo 'NAMESPACE  HOST-ID                 CONTAINER-ID               NS-STATUS'
  for NS in cgroup ipc mnt net pid user uts; do
    HOST_NS=\$(sudo readlink /proc/1/ns/\$NS)
    CONTAINER_NS=\$(sudo readlink /proc/\$CONTAINER_PID/ns/\$NS)
    HOST_ID=\$(echo \$HOST_NS | sed 's/.*\\[\\(.*\\)\\]/\\1/')
    CONTAINER_ID=\$(echo \$CONTAINER_NS | sed 's/.*\\[\\(.*\\)\\]/\\1/')
    SHARED=\$([ \"\$HOST_NS\" = \"\$CONTAINER_NS\" ] && echo 'SHARED' || echo 'ISOLATED')
    printf \"%-10s %-25s %-24s %s\\n\" \"\$NS\" \"\$HOST_ID\" \"\$CONTAINER_ID\" \"\$SHARED\"
  done
"

# Execute into the pod and run ps aux
echo -e "\n=== Executing into pod to run ps aux ==="
kubectl exec -it pid-shared -- ps aux

# Clean up
echo -e "\n=== Cleaning up ==="
kubectl delete pod pid-shared --force
Creating the pod and viewing host processes on the node from the container

hostIPC

The next pod being applied is only setting hostIPC: true.

apiVersion: v1
kind: Pod
metadata:
  name: shared
spec:
  hostIPC: true
  containers:
  - name: busybox
    image: busybox
    command: ["sleep", "3600"]

Setting the hostIPC: true allows the pods to share the ipc namespace which allows containers to do various shenanigans such as accessing the hosts shared memory segments, interfere with semaphores and message queues, etc. Again, no surprises here, we see the ipc namespace is now shared

NAMESPACE  HOST-ID                 CONTAINER-ID               NS-STATUS
cgroup     4026534899                4026536127               ISOLATED
ipc        4026534893                4026534893               SHARED
mnt        4026534836                4026536124               ISOLATED
net        4026534900                4026536064               ISOLATED
pid        4026534898                4026536126               ISOLATED
user       4026531837                4026531837               SHARED
uts        4026534891                4026536125               ISOLATED

IPC Demonstration

Exec-ing into the pod allows for inspection of the hosts ipc is visible inside the container by running sudo ipcs -q. They’re the same, demonstrating the shared namespace. You can read more about message queues here.

#!/bin/bash

# Start minikube if not running
minikube status &> /dev/null || minikube start

# Create hostIPC pod
echo "=== Creating pod with hostIPC ==="
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: ipc-shared
spec:
#  hostIPC: true
  containers:
  - name: ipc-test
    image: busybox
    command: ["sleep", "3600"]
EOF

# Wait for pod to be ready
echo "Waiting for pod to be running..."
while [ "$(kubectl get pod ipc-shared -o 'jsonpath={.status.phase}' 2>/dev/null)" != "Running" ]; do
  sleep 2
done

# Show only message queues on host
echo -e "\n=== Message Queues visible on host ==="
minikube ssh "sudo ipcs -q"

# Show only message queues in container
echo -e "\n=== Message Queues visible in container ==="
kubectl exec -it ipc-shared -- ipcs -q

# Clean up
echo -e "\n=== Cleaning up ==="
kubectl delete pod ipc-shared --force
Viewing hostIPC info such as message queues

Privileged vs All Specs

Now that we’ve tested the big three pod specs, lets compare our original privileged Pod to our Pod manifest where we explicitly set the spec to share certain namespaces.

=== PRIVILEGED ONLY ===
apiVersion: v1
kind: Pod
metadata:
  name: privileged-only
spec:
  containers:
  - name: busybox
    image: busybox
    command: ["sleep", "3600"]
    securityContext:
      privileged: true
      
NAMESPACE  HOST-ID                 CONTAINER-ID             STATUS
cgroup     4026534899                4026536059               ISOLATED
ipc        4026534893                4026535975               ISOLATED
mnt        4026534836                4026536056               ISOLATED
net        4026534900                4026535982               ISOLATED
pid        4026534898                4026536058               ISOLATED
user       4026531837                4026531837               SHARED
uts        4026534891                4026536057               ISOLATED

=== SHARED NAMESPACES ===
apiVersion: v1
kind: Pod
metadata:
  name: shared
spec:
  hostNetwork: true
  hostPID: true
  hostIPC: true
  containers:
  - name: busybox
    image: busybox
    command: ["sleep", "3600"]
    
NAMESPACE  HOST-ID                 CONTAINER-ID             STATUS
cgroup     4026534899                4026536064               ISOLATED
ipc        4026534893                4026534893               SHARED
mnt        4026534836                4026536063               ISOLATED
net        4026534900                4026534900               SHARED
pid        4026534898                4026534898               SHARED
user       4026531837                4026531837               SHARED
uts        4026534891                4026534891               SHARED

We have verified that explicitly setting pod specifications that allow sharing of the host namespace (hostNetwork: true,hostPID: true, and hostIPC: true), indeed place the Pod’s container into the same namespace as the host and that running a pod as Privileged does not place a Pod’s container in the same namespace as the host. Why?

Digging deeper through the Kubernetes documentation gave some subtle hints as to what is actually going on

“Any container in a Pod can enable Privileged mode if you set the privileged: true field in the securityContext field for the container. Privileged containers override or undo many other hardening settings such as the applied seccomp profile, AppArmor profile, or SELinux constraints. Privileged containers are given all Linux capabilities, including capabilities that they don’t require. For example, a root user in a privileged container might be able to use the CAP_SYS_ADMIN and CAP_NET_ADMIN capabilities on the node, bypassing the runtime seccomp configuration and other restrictions.” - Kubernetes Documentation

How are pods marked as Privileged able to access resources on the host if they’re “isolated” in their own namespaces?

Capabilities

It turns out, the answer is: capabilities.

While this post isn’t a deep dive into Linux capabilities, it might be helpful to briefly recap what capabilities are in Linux.

Capabilities are gradual “permissions” that can be assigned to programs that allow very specific privileges that are otherwise reserved for the root user. This is great from the principal of least privilege perspective as it reduces the attack surface if an attacker is able to tamper with a program.

In a world without capabilities, if a user wanted to run a webserver on port 80, that process would have to be run as root because binding to ports under 1024 is a “privileged” action that must be done by root.

With the CAP_NET_BIND_SERVICE capability, the user can take that one very specific “rootly” power of binding to a privileged port and give it to that program instead of running that webserver as root using the sudo setcap cap_net_bind_service=+ep webserver_program command

Demo of adding cap_net_bind
#!/bin/bash

# Set text colors
RED='\033[0;31m'
GREEN='\033[0;32m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color

echo -e "${BLUE}=== CAP_NET_BIND_SERVICE Demonstration ===${NC}"
echo "Testing ability to bind to privileged port 80"

# Setup
TEMP_DIR=$(mktemp -d)
cd "$TEMP_DIR"
cp $(which nc) ./nc_no_caps

# Test 1: Without capability
echo -e "\n${BLUE}Test 1: Without capability${NC}"
echo -e "$ ./nc_no_caps -l 80"
./nc_no_caps -l 80 &
PID=$!
sleep 1
if kill -0 $PID 2>/dev/null; then
  echo -e "${GREEN}Unexpectedly succeeded${NC}"
  kill $PID
else
  echo -e "${RED}Failed as expected - Permission denied${NC}"
fi

# Test 2: With capability
echo -e "\n${BLUE}Test 2: With CAP_NET_BIND_SERVICE capability${NC}"
echo -e "$ sudo setcap cap_net_bind_service=+ep ./nc_no_caps"
sudo setcap cap_net_bind_service=+ep ./nc_no_caps
echo -e "$ getcap ./nc_no_caps"
getcap ./nc_no_caps

echo -e "$ ./nc_no_caps -l 80"
./nc_no_caps -l 80 &
PID=$!
sleep 1
if kill -0 $PID 2>/dev/null; then
  echo -e "${GREEN}Success! Process is running${NC}"
  
  # Verify binding
  echo -e "$ ss -tuln | grep :80"
  ss -tuln | grep :80
  
  echo -e "Killing process: $ kill $PID"
  kill $PID
else
  echo -e "${RED}Failed to run - port 80 may already be in use${NC}"
fi

# Clean up
echo -e "\n${BLUE}Cleaning up${NC}"
echo -e "$ cd ~"
cd ~
echo -e "$ rm -rf $TEMP_DIR"
rm -rf "$TEMP_DIR"

echo -e "\n${GREEN}Demonstration complete!${NC}"

Getting a list of capabilities can be done by running grep -E '^#define CAP_' /usr/include/linux/capability.h. Currently there are ~40.

The end of the rabbithole

Now that we have briefly touched on Linux capabilities, lets see how they’re used to answer the question: How are pods marked as Privileged able to access resources on the host if they’re “isolated” in their own namespaces?

The key is that there are two separate mechanisms at work:

  1. Namespace sharing (hostNetwork, hostPID, hostIPC) - These settings determine which namespaces a pod shares with the host, providing specific types of access.
  2. Linux capabilities - These determine what privileged operations a container can perform, regardless of namespace isolation.

One final time, lets write a script that makes this process much easier to understand, this is a long one but essentially we want to get all the capabilities assigned to each container.

#!/bin/bash

# Define color codes
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[0;33m'
BLUE='\033[0;34m'
PURPLE='\033[0;35m'
CYAN='\033[0;36m'
WHITE='\033[0;37m'
BOLD='\033[1m'
BG_RED='\033[41m'
BG_GREEN='\033[42m'
BG_BLUE='\033[44m'
NC='\033[0m' # No Color

# Function for colorful section headers
print_header() {
  echo -e "\n${BG_BLUE}${BOLD}${WHITE} $1 ${NC}"
  echo -e "${BLUE}$(printf '=%.0s' {1..70})${NC}"
}

# Function for sub-headers
print_subheader() {
  echo -e "\n${BOLD}${CYAN} $1 ${NC}"
  echo -e "${CYAN}$(printf '-%.0s' {1..70})${NC}"
}

# Function to show status
show_status() {
  if [ "$2" == "OK" ]; then
    echo -e "[$1] ${GREEN}$3${NC}"
  elif [ "$2" == "WARN" ]; then
    echo -e "[$1] ${YELLOW}$3${NC}"
  else
    echo -e "[$1] ${RED}$3${NC}"
  fi
}

# Start minikube if not running
print_header "KUBERNETES SECURITY TEST SUITE"
echo -e "${YELLOW}Initializing test environment...${NC}"

show_status "SETUP" "WARN" "Checking if minikube is running"
if minikube status &> /dev/null; then
  show_status "SETUP" "OK" "Minikube is running"
else
  show_status "SETUP" "WARN" "Starting minikube"
  minikube start
  show_status "SETUP" "OK" "Minikube started successfully"
fi

# Wait for default service account
show_status "SETUP" "WARN" "Waiting for default service account"
while ! kubectl get serviceaccount default &> /dev/null; do
  sleep 2
  echo -n -e "${YELLOW}.${NC}"
done
show_status "SETUP" "OK" "Default service account is available"

# Ensure bc is installed on minikube node
show_status "SETUP" "WARN" "Checking if bc is installed in minikube"
minikube ssh "command -v bc &> /dev/null || { echo 'Installing bc...'; sudo apt-get update && sudo apt-get install -y bc; }"
show_status "SETUP" "OK" "bc is available in minikube"

# Function to check namespaces and capabilities
check_container() {
  POD_NAME=$1
  echo "CHECKING CONTAINER: $POD_NAME"
  
  # Get container ID
  CONTAINER_ID=$(kubectl get pod $POD_NAME -o 'jsonpath={.status.containerStatuses[0].containerID}' | sed 's/docker:\/\///')
  echo -e "${BOLD}Container ID:${NC} ${YELLOW}$CONTAINER_ID${NC}"
  
  # Check namespaces and compare with host
  minikube ssh "
    # Get container PID
    CONTAINER_PID=\$(sudo docker inspect --format='{{.State.Pid}}' $CONTAINER_ID)
    echo -e \"Container PID on host: \${CONTAINER_PID}\"
    echo \"\"
    
    # Compare namespaces
    echo -e \"\033[1;36m--- Namespace Comparison ---\033[0m\"
    echo -e \"\033[1mNAMESPACE  HOST-ID                 CONTAINER-ID             STATUS\033[0m\"
    echo -e \"\033[36m---------- ----------------------- ------------------------ --------\033[0m\"
    for NS in cgroup ipc mnt net pid user uts; do
      HOST_NS=\$(sudo readlink /proc/1/ns/\$NS)
      CONTAINER_NS=\$(sudo readlink /proc/\$CONTAINER_PID/ns/\$NS)
      
      # Extract just the ID number for cleaner display
      HOST_ID=\$(echo \$HOST_NS | sed 's/.*\\[\\(.*\\)\\]/\\1/')
      CONTAINER_ID=\$(echo \$CONTAINER_NS | sed 's/.*\\[\\(.*\\)\\]/\\1/')
      
      SHARED=\$([ \"\$HOST_NS\" = \"\$CONTAINER_NS\" ] && echo -e \"\033[31mSHARED\033[0m\" || echo -e \"\033[32mISOLATED\033[0m\")
      printf \"%-10s %-25s %-24s %b\\n\" \"\$NS\" \"\$HOST_ID\" \"\$CONTAINER_ID\" \"\$SHARED\"
    done
    
    echo \"\"
    echo -e \"\033[1;36m--- Capability Information ---\033[0m\"
    echo \"Checking capabilities from /proc/\$CONTAINER_PID/status:\"
    echo \"\"
    echo -e \"\033[1;33m### Container Capabilities ###\033[0m\"
    CONT_CAP_EFF=\$(sudo grep CapEff /proc/\$CONTAINER_PID/status | awk '{print \$2}')
    CONT_CAP_PRM=\$(sudo grep CapPrm /proc/\$CONTAINER_PID/status | awk '{print \$2}')
    CONT_CAP_INH=\$(sudo grep CapInh /proc/\$CONTAINER_PID/status | awk '{print \$2}')
    CONT_CAP_BND=\$(sudo grep CapBnd /proc/\$CONTAINER_PID/status | awk '{print \$2}')
    CONT_CAP_AMB=\$(sudo grep CapAmb /proc/\$CONTAINER_PID/status | awk '{print \$2}')
    
    echo -e \"\033[1mCapEff (Effective):  \033[0m \033[33m\$CONT_CAP_EFF\033[0m\"
    echo -e \"\033[1mCapPrm (Permitted):  \033[0m \033[33m\$CONT_CAP_PRM\033[0m\"
    echo -e \"\033[1mCapInh (Inheritable):\033[0m \033[33m\$CONT_CAP_INH\033[0m\"
    echo -e \"\033[1mCapBnd (Bounding):   \033[0m \033[33m\$CONT_CAP_BND\033[0m\"
    echo -e \"\033[1mCapAmb (Ambient):    \033[0m \033[33m\$CONT_CAP_AMB\033[0m\"
    
    # Save effective capabilities for summary comparison
    echo \$CONT_CAP_EFF > /tmp/${POD_NAME}_capeff
    
    # Decode effective capabilities 
    echo \"\"
    echo -e \"\033[1;33m### Decoded Effective Capabilities ###\033[0m\"
    echo \"The effective capabilities (CapEff) determine what privileged operations a process can actually perform.\"
    echo \"\"
    
    # Using bc to decode capabilities
    echo \"Using bc to decode capabilities...\"
    
    # Using a simplified list of common capabilities for demonstration
    cap_map=(
      \"0:CAP_CHOWN - Change file ownership and group\"
      \"1:CAP_DAC_OVERRIDE - Bypass file read, write, and execute permission checks\"
      \"2:CAP_DAC_READ_SEARCH - Bypass file read permission checks\"
      \"3:CAP_FOWNER - Bypass permission checks on operations that normally require the file system UID to match the process's\"
      \"4:CAP_FSETID - Don't clear set-user-ID and set-group-ID mode bits when a file is modified\"
      \"5:CAP_KILL - Bypass permission checks for sending signals\"
      \"6:CAP_SETGID - Make arbitrary manipulations of process GIDs\"
      \"7:CAP_SETUID - Make arbitrary manipulations of process UIDs\"
      \"8:CAP_SETPCAP - Modify process capabilities\"
      \"9:CAP_LINUX_IMMUTABLE - Set the FS_APPEND_FL and FS_IMMUTABLE_FL flags\"
      \"10:CAP_NET_BIND_SERVICE - Bind a socket to Internet domain privileged ports\"
      \"11:CAP_NET_BROADCAST - Make socket broadcasts and listen to multicasts\"
      \"12:CAP_NET_ADMIN - Perform network administration tasks\"
      \"13:CAP_NET_RAW - Use raw sockets\"
      \"14:CAP_IPC_LOCK - Lock memory\"
      \"15:CAP_IPC_OWNER - Bypass permissions on message queues and shared memory\"
      \"16:CAP_SYS_MODULE - Load and unload kernel modules\"
      \"17:CAP_SYS_RAWIO - Perform I/O port operations\"
      \"18:CAP_SYS_CHROOT - Use chroot()\"
      \"19:CAP_SYS_PTRACE - Trace arbitrary processes\"
      \"20:CAP_SYS_PACCT - Configure process accounting\"
      \"21:CAP_SYS_ADMIN - Perform various system administration operations\"
      \"22:CAP_SYS_BOOT - Use reboot() and kexec_load()\"
      \"23:CAP_SYS_NICE - Raise process nice value and change nice value for arbitrary processes\"
      \"24:CAP_SYS_RESOURCE - Override resource limits\"
      \"25:CAP_SYS_TIME - Set system clock\"
      \"26:CAP_SYS_TTY_CONFIG - Configure TTY devices\"
      \"27:CAP_MKNOD - Create special files\"
      \"28:CAP_LEASE - Establish leases on files\"
      \"29:CAP_AUDIT_WRITE - Write records to kernel auditing log\"
      \"30:CAP_AUDIT_CONTROL - Configure audit subsystem\"
      \"31:CAP_SETFCAP - Set file capabilities\"
      \"32:CAP_MAC_OVERRIDE - Override MAC restrictions\"
      \"33:CAP_MAC_ADMIN - Configure MAC\"
      \"34:CAP_SYSLOG - Perform privileged syslog operations\"
      \"35:CAP_WAKE_ALARM - Trigger wake alarms\"
      \"36:CAP_BLOCK_SUSPEND - Block system suspend\"
      \"37:CAP_AUDIT_READ - Read audit log\"
    )
    
    # Convert hex to binary
    binary=\$(printf \"%064s\" \$(bc <<< \"obase=2;ibase=16;\${CONT_CAP_EFF^^}\") | tr ' ' '0')
    
    # Reverse for bit position
    reversed_binary=\$(echo \$binary | rev)
    
    echo -e \"Capability hex: \033[36m\$CONT_CAP_EFF\033[0m\"
    
    # Check each bit and display capability if set
    for cap in \"\${cap_map[@]}\"; do
      pos=\$(echo \$cap | cut -d':' -f1)
      desc=\$(echo \$cap | cut -d':' -f2-)
      
      if [ \"\${reversed_binary:\$pos:1}\" = \"1\" ]; then
        echo -e \" - \033[32m\$desc\033[0m\"
      fi
    done
  "
  echo ""
}

print_header "KUBERNETES NAMESPACE & CAPABILITY TESTS"

# 1. Create a baseline pod (no special privileges)
echo "CREATING POD 1: baseline-pod (no special privileges)"
POD_MANIFEST=$(cat <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: baseline-pod
spec:
  containers:
  - name: busybox
    image: busybox
    command: ["sleep", "3600"]
EOF
)
echo "$POD_MANIFEST" | kubectl apply -f -
show_status "POD" "OK" "Baseline pod created"

# 2. Create privileged pod
echo "CREATING POD 2: privileged-pod (privileged: true)"
POD_MANIFEST=$(cat <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: privileged-pod
spec:
  containers:
  - name: busybox
    image: busybox
    command: ["sleep", "3600"]
    securityContext:
      privileged: true
EOF
)
echo "$POD_MANIFEST" | kubectl apply -f -
show_status "POD" "OK" "Privileged pod created"

# 3. Create hostNetwork pod
echo "CREATING POD 3: hostnetwork-pod (hostNetwork: true)"
POD_MANIFEST=$(cat <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: hostnetwork-pod
spec:
  hostNetwork: true
  containers:
  - name: busybox
    image: busybox
    command: ["sleep", "3600"]
EOF
)
echo "$POD_MANIFEST" | kubectl apply -f -
show_status "POD" "OK" "HostNetwork pod created"

# 4. Create hostPID pod
echo "CREATING POD 4: hostpid-pod (hostPID: true)"
POD_MANIFEST=$(cat <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: hostpid-pod
spec:
  hostPID: true
  containers:
  - name: busybox
    image: busybox
    command: ["sleep", "3600"]
EOF
)
echo "$POD_MANIFEST" | kubectl apply -f -
show_status "POD" "OK" "HostPID pod created"

# 5. Create hostIPC pod
echo "CREATING POD 5: hostipc-pod (hostIPC: true)"
POD_MANIFEST=$(cat <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: hostipc-pod
spec:
  hostIPC: true
  containers:
  - name: busybox
    image: busybox
    command: ["sleep", "3600"]
EOF
)
echo "$POD_MANIFEST" | kubectl apply -f -
show_status "POD" "OK" "HostIPC pod created"

# Wait for all pods to be ready
echo -e "\n${YELLOW}Waiting for pods to be ready...${NC}"
PODS=(baseline-pod privileged-pod hostnetwork-pod hostpid-pod hostipc-pod)
for pod in "${PODS[@]}"; do
  echo -n "Waiting for $pod: "
  while [ "$(kubectl get pod $pod -o 'jsonpath={.status.phase}' 2>/dev/null)" != "Running" ]; do
    sleep 2
    echo -n -e "${YELLOW}.${NC}"
  done
  echo -e " ${GREEN}Running!${NC}"
done
echo -e "${GREEN}All pods are running${NC}"
echo ""

# Check capabilities for all pods
for pod in "${PODS[@]}"; do
  check_container "$pod"
done

# Generate a capabilities comparison summary
print_header "CAPABILITIES COMPARISON SUMMARY"
echo "This table shows which critical capabilities are granted to each pod configuration."
echo ""

# Use minikube to generate the comparison
minikube ssh "
  # Define key capabilities to check for
  declare -a CAPS=(
    '21:SYS_ADMIN'
    '12:NET_ADMIN'
    '19:SYS_PTRACE'
    '16:SYS_MODULE'
    '7:SETUID'
    '24:SYS_RESOURCE'
    '17:SYS_RAWIO'
  )
  
  # Print table header with colors
  printf \"\033[1;36m%-20s\" \"POD NAME\"
  for cap in \"\${CAPS[@]}\"; do
    capname=\$(echo \$cap | cut -d':' -f2)
    printf \"\033[1;36m%-12s\" \"\$capname\"
  done
  printf \"\033[0m\\n\"
  
  # Print separator
  printf \"\033[36m%s\" \"--------------------\"
  for cap in \"\${CAPS[@]}\"; do
    printf \"%s\" \"------------\"
  done
  printf \"\033[0m\\n\"
  
  # Check each pod
  for pod in baseline-pod privileged-pod hostnetwork-pod hostpid-pod hostipc-pod; do
    if [ -f \"/tmp/\${pod}_capeff\" ]; then
      capeff=\$(cat /tmp/\${pod}_capeff)
      
      # Convert hex to binary
      binary=\$(printf \"%064s\" \$(bc <<< \"obase=2;ibase=16;\${capeff^^}\") | tr ' ' '0')
      
      # Reverse for bit position
      reversed_binary=\$(echo \$binary | rev)
      
      # Print pod name
      printf \"\033[1;33m%-20s\033[0m\" \"\$pod\"
      
      # Check each capability
      for cap in \"\${CAPS[@]}\"; do
        pos=\$(echo \$cap | cut -d':' -f1)
        
        if [ \"\${reversed_binary:\$pos:1}\" = \"1\" ]; then
          printf \"\033[1;32m%-12s\033[0m\" \"YES\"
        else
          printf \"\033[1;31m%-12s\033[0m\" \"NO\"
        fi
      done
      printf \"\\n\"
    else
      printf \"\033[1;33m%-20s \033[1;31mCAPABILITY DATA NOT FOUND\033[0m\\n\" \"\$pod\"
    fi
  done
"

# Cleanup
print_header "CLEANUP"
echo -e "${YELLOW}Cleaning up resources...${NC}"
for pod in "${PODS[@]}"; do
  kubectl delete pod $pod --force --wait=false 2>/dev/null
  show_status "CLEANUP" "OK" "Deleted pod: $pod"
done
minikube ssh "sudo rm -f /tmp/*_capeff"
echo -e "${GREEN}${BOLD}All done! Tests completed successfully!${NC}"

Taking a look at the output of our baseline pod that isn’t privileged and doesn’t have and pod specifications to alter the namespaces, we can see that there are 14 capabilities this Pod’s container has.

Output of baseline pod

Interestingly, every other pod (with the exception of the privileged pod, which we will touch on in a moment), has the same capabilities defined, demonstrating that setting hostNetwork: true, hostPID: true, and hostIPC: true, do NOT grant additional capabilities to the pod and instead their “enhanced privileges” are actually due to the sharing of namespaces observed previously.

The following summary shows this well. The privileged pod is the only one with sensitive capabilities across the board.

Privileged pod having capabilities

Taking a deeper look at the pod with hostNetwork: true and we can see that net and uts namespaces are shared, but there are no other linux capabilities that are providing additional privileges

hostNetwork true pod not having capabilities

That brings us to our privileged pod. This pod, has essentially every capability assigned to it

Privileged pod having nearly all capabilities

The privileged Pod’s capabilities explain why it’s considered dangerous from a security perspective. While namespace sharing provides specific access to host resources, the privileged flag grants broad system-level permissions. This allows the Pod’s container to perform scary operations like loading kernel modules, manipulating network devices, or bypassing file permission checks.

We can easily creating a script to demonstrate these risks. Pay special attention to demonstration #4 which shows that this pod explicit cannot (directly) access the hosts pid namespace as we did not specify the hostPID: true field.

#!/bin/bash
# Set colors for better readability
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[0;33m'
NC='\033[0m' # No Color
echo -e "${YELLOW}Creating a YAML file for a privileged pod...${NC}"
# Create a YAML file for the privileged pod
cat > privileged-pod.yaml << 'EOL'
apiVersion: v1
kind: Pod
metadata:
  name: privileged-pod
spec:
  containers:
  - name: sleep-container
    image: busybox:latest
    command: ["sleep", "3600"]
    securityContext:
      privileged: true
  restartPolicy: Never
EOL
echo -e "${GREEN}YAML file created. Creating the privileged pod...${NC}"
kubectl apply -f privileged-pod.yaml
echo -e "${YELLOW}Waiting for pod to be ready...${NC}"
kubectl wait --for=condition=Ready pod/privileged-pod --timeout=60s 2>/dev/null || sleep 15
echo -e "${GREEN}Pod is running. Now we'll demonstrate the risks.${NC}"
# Demonstration 1: Direct device access
echo -e "\n${YELLOW}RISK DEMONSTRATION 1: Direct device access${NC}"
echo -e "Command: kubectl exec privileged-pod -- ls -la /dev/"
kubectl exec privileged-pod -- ls -la /dev/ | head -20
# Demonstration 2: Access to host cgroups and proc
echo -e "\n${YELLOW}RISK DEMONSTRATION 2: Host system access${NC}"
echo -e "Command: kubectl exec privileged-pod -- cat /etc/passwd"
kubectl exec privileged-pod -- cat /etc/passwd
# Demonstration 3: Creating a device node (mknod)
echo -e "\n${YELLOW}RISK DEMONSTRATION 3: Creating device nodes${NC}"
echo -e "Command: kubectl exec privileged-pod -- mknod /dev/malicious-device c 1 1"
kubectl exec privileged-pod -- mknod /dev/malicious-device c 1 1
echo -e "Command: kubectl exec privileged-pod -- ls -la /dev/malicious-device"
kubectl exec privileged-pod -- ls -la /dev/malicious-device

# Demonstration 4: Try to access host PID namespace (will show limitation)
echo -e "\n${YELLOW}RISK DEMONSTRATION 4: No access to host PID namespace by default${NC}"
echo -e "Command: kubectl exec privileged-pod -- ps aux"
kubectl exec privileged-pod -- ps aux
echo -e "\n${YELLOW}Notice: The privileged pod only shows container processes, not host processes.${NC}"
echo -e "${YELLOW}To access host PID namespace, you need to explicitly set hostPID: true in the pod spec.${NC}"

# Demonstration 5: Mounting host filesystem
echo -e "\n${YELLOW}RISK DEMONSTRATION 5: Mounting host filesystem${NC}"
echo -e "Command: kubectl exec privileged-pod -- mkdir -p /mnt/host"
kubectl exec privileged-pod -- mkdir -p /mnt/host
echo -e "Command: kubectl exec privileged-pod -- mount -t proc none /mnt/host"
kubectl exec privileged-pod -- mount -t proc none /mnt/host
echo -e "Command: kubectl exec privileged-pod -- ls -la /mnt/host"
kubectl exec privileged-pod -- ls -la /mnt/host | head -10
echo -e "\n${YELLOW}Cleaning up...${NC}"
kubectl delete pod privileged-pod
rm privileged-pod.yaml
echo -e "${GREEN}Demonstration completed.${NC}"
output demonstrating the risks of pod capabilities

Extra Resources

  • The list capabilities a privileged pod has that a non-privileged pod does NOT have.
- CAP_DAC_READ_SEARCH - Bypass file read permission checks
- CAP_LINUX_IMMUTABLE - Set the FS_APPEND_FL and FS_IMMUTABLE_FL flags
- CAP_NET_BROADCAST - Make socket broadcasts and listen to multicasts
- CAP_NET_ADMIN - Perform network administration tasks
- CAP_IPC_LOCK - Lock memory
- CAP_IPC_OWNER - Bypass permissions on message queues and shared memory
- CAP_SYS_MODULE - Load and unload kernel modules
- CAP_SYS_RAWIO - Perform I/O port operations
- CAP_SYS_PTRACE - Trace arbitrary processes
- CAP_SYS_PACCT - Configure process accounting
- CAP_SYS_ADMIN - Perform various system administration operations
- CAP_SYS_BOOT - Use reboot() and kexec_load()
- CAP_SYS_NICE - Raise process nice value and change nice value for arbitrary processes
- CAP_SYS_RESOURCE - Override resource limits
- CAP_SYS_TIME - Set system clock
- CAP_SYS_TTY_CONFIG - Configure TTY devices
- CAP_LEASE - Establish leases on files
- CAP_AUDIT_CONTROL - Configure audit subsystem
- CAP_MAC_OVERRIDE - Override MAC restrictions
- CAP_MAC_ADMIN - Configure MAC
- CAP_SYSLOG - Perform privileged syslog operations
- CAP_WAKE_ALARM - Trigger wake alarms
- CAP_BLOCK_SUSPEND - Block system suspend
- CAP_AUDIT_READ - Read audit log

Lab environment

minikube config view && minikube profile list && kubectl get all --all-namespaces
Lab environment details